80 research outputs found
A Fast Quartet Tree Heuristic for Hierarchical Clustering
The Minimum Quartet Tree Cost problem is to construct an optimal weight tree
from the weighted quartet topologies on objects, where
optimality means that the summed weight of the embedded quartet topologies is
optimal (so it can be the case that the optimal tree embeds all quartets as
nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized
hill climbing, for approximating the optimal weight tree, given the quartet
topology weights. The method repeatedly transforms a dendrogram, with all
objects involved as leaves, achieving a monotonic approximation to the exact
single globally optimal tree. The problem and the solution heuristic has been
extensively used for general hierarchical clustering of nontree-like
(non-phylogeny) data in various domains and across domains with heterogeneous
data. We also present a greatly improved heuristic, reducing the running time
by a factor of order a thousand to ten thousand. All this is implemented and
available, as part of the CompLearn package. We compare performance and running
time of the original and improved versions with those of UPGMA, BioNJ, and NJ,
as implemented in the SplitsTree package on genomic data for which the latter
are optimized.
Keywords: Data and knowledge visualization, Pattern
matching--Clustering--Algorithms/Similarity measures, Hierarchical clustering,
Global optimization, Quartet tree, Randomized hill-climbing,Comment: LaTeX, 40 pages, 11 figures; this paper has substantial overlap with
arXiv:cs/0606048 in cs.D
Normalized Web Distance and Word Similarity
There is a great deal of work in cognitive psychology, linguistics, and
computer science, about using word (or phrase) frequencies in context in text
corpora to develop measures for word similarity or word association, going back
to at least the 1960s. The goal of this chapter is to introduce the
normalizedis a general way to tap the amorphous low-grade knowledge available
for free on the Internet, typed in by local users aiming at personal
gratification of diverse objectives, and yet globally achieving what is
effectively the largest semantic electronic database in the world. Moreover,
this database is available for all by using any search engine that can return
aggregate page-count estimates for a large range of search-queries. In the
paper introducing the NWD it was called `normalized Google distance (NGD),' but
since Google doesn't allow computer searches anymore, we opt for the more
neutral and descriptive NWD. web distance (NWD) method to determine similarity
between words and phrases. ItComment: Latex, 20 pages, 7 figures, to appear in: Handbook of Natural
Language Processing, Second Edition, Nitin Indurkhya and Fred J. Damerau
Eds., CRC Press, Taylor and Francis Group, Boca Raton, FL, 2010, ISBN
978-142008592
Normalized Information Distance
The normalized information distance is a universal distance measure for
objects of all kinds. It is based on Kolmogorov complexity and thus
uncomputable, but there are ways to utilize it. First, compression algorithms
can be used to approximate the Kolmogorov complexity if the objects have a
string representation. Second, for names and abstract concepts, page count
statistics from the World Wide Web can be used. These practical realizations of
the normalized information distance can then be applied to machine learning
tasks, expecially clustering, to perform feature-free and parameter-free data
mining. This chapter discusses the theoretical foundations of the normalized
information distance and both practical realizations. It presents numerous
examples of successful real-world applications based on these distance
measures, ranging from bioinformatics to music clustering to machine
translation.Comment: 33 pages, 12 figures, pdf, in: Normalized information distance, in:
Information Theory and Statistical Learning, Eds. M. Dehmer, F.
Emmert-Streib, Springer-Verlag, New-York, To appea
Sensitivity to inflectional morphemes in the absence of meaning: evidence from a novel task
A number of studies in different languages have shown that speakers may be sensitive to the presence of inflectional morphology in the absence of verb meaning (Caramazza et al., 1988, Clahsen, 1999, Post et al., 2008). In this study, sensitivity to inflectional morphemes was tested in a purposely developed task with English-like nonwords. Native speakers of English were presented with pairs of nonwords and were asked to judge whether the two nonwords in each pair were the same or different. Each pair was composed either of the same nonword repeated twice, or of two slightly different nonwords. The nonwords were created taking advantage of a specific morphophonological property of English, which is that regular inflectional morphemes agree in voicing with the ending of the stem. Using stems ending in /l/, thus, we created: 1. nonwords ending in potential inflectional morphemes, vɔld, 2. nonwords without inflectional morphemes, vɔlt, and 3. a phonological control condition, vɔlb.
Our new task endorses some strengths presented in previous work. As in Post et al. (2008) the task accounts for the importance of phonological cues to morphological processing. In addition, as in Caramazza et al. (1988) and contrary to Post et al. (2008), the task never presents bare-stems, making it unlikely that the participants would be aware of the manipulation performed. Our results are in line with Caramazza et al. (1988), Clahsen (1999) and Post et al. (2008), and offer further evidence that morphologically inflected nonwords take longer to be discriminated compared to uninflected nonwords
Effect of heuristics on serendipity in path-based storytelling with linked data
Path-based storytelling with Linked Data on the Web provides users the ability to discover concepts in an entertaining and educational way. Given a query context, many state-of-the-art pathfinding approaches aim at telling a story that coincides with the user's expectations by investigating paths over Linked Data on the Web. By taking into account serendipity in storytelling, we aim at improving and tailoring existing approaches towards better fitting user expectations so that users are able to discover interesting knowledge without feeling unsure or even lost in the story facts. To this end, we propose to optimize the link estimation between - and the selection of facts in a story by increasing the consistency and relevancy of links between facts through additional domain delineation and refinement steps. In order to address multiple aspects of serendipity, we propose and investigate combinations of weights and heuristics in paths forming the essential building blocks for each story. Our experimental findings with stories based on DBpedia indicate the improvements when applying the optimized algorithm
Satellites Form Fast & Late: a Population Synthesis for the Galilean Moons
The satellites of Jupiter are thought to form in a circumplanetary disc. Here
we address their formation and orbital evolution with a population synthesis
approach, by varying the dust-to-gas ratio, the disc dispersal timescale and
the dust refilling timescale. The circumplanetary disc initial conditions
(density and temperature) are directly drawn from the results of 3D radiative
hydrodynamical simulations. The disc evolution is taken into account within the
population synthesis. The satellitesimals were assumed to grow via streaming
instability. We find that the moons form fast, often within years, due
to the short orbital timescales in the circumplanetary disc. They form in
sequence, and many are lost into the planet due to fast type I migration,
polluting Jupiter's envelope with typically 15 Earth-masses of metals. The last
generation of moons can form very late in the evolution of the giant planet,
when the disc has already lost more than the 99% of its mass. The late
circumplanetary disc is cold enough to sustain water ice, hence not
surprisingly the 85% of the moon population has icy composition. The
distribution of the satellite-masses is peaking slightly above Galilean masses,
up until a few Earth-masses, in a regime which is observable with the current
instrumentation around Jupiter-analog exoplanets orbiting sufficiently close to
their host stars. We also find that systems with Galilean-like masses occur in
20% of the cases and they are more likely when discs have long dispersion
timescales and high dust-to-gas ratios.Comment: 15 pages, 17 figures. Accepted by MNRAS, please check the final
published versio
Evaluation of Fused Pyrrolothiazole Systems as Correctors of Mutant CFTR Protein
Cystic fibrosis (CF) is a genetic disease caused by mutations that impair the function of the CFTR chloride channel. The most frequent mutation, F508del, causes misfolding and premature degradation of CFTR protein. This defect can be overcome with pharmacological agents named "correctors". So far, at least three different classes of correctors have been identified based on the additive/synergistic effects that are obtained when compounds of different classes are combined together. The development of class 2 correctors has lagged behind that of compounds belonging to the other classes. It was shown that the efficacy of the prototypical class 2 corrector, the bithiazole corr-4a, could be improved by generating conformationally-locked bithiazoles. In the present study, we investigated the effect of tricyclic pyrrolothiazoles as analogues of constrained bithiazoles. Thirty-five compounds were tested using the functional assay based on the halide-sensitive yellow fluorescent protein (HS-YFP) that measured CFTR activity. One compound, having a six atom carbocyle central ring in the tricyclic pyrrolothiazole system and bearing a pivalamide group at the thiazole moiety and a 5-chloro-2-methoxyphenyl carboxamide at the pyrrole ring, significantly increased F508del-CFTR activity. This compound could lead to the synthesis of a novel class of CFTR correctors
Semantic disambiguation and contextualisation of social tags
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-28509-7_18This manuscript is an extended version of the paper ‘cTag: Semantic Contextualisation of Social Tags’, presented at the 6th International Workshop on Semantic Adaptive Social Web (SASWeb 2011).We present an algorithmic framework to accurately and efficiently identify the semantic meanings and contexts of social tags within a particular folksonomy. The framework is used for building contextualised tag-based user and item profiles. We also present its implementation in a system called cTag, with which we preliminary analyse semantic meanings and contexts of tags belonging to Delicious and MovieLens folksonomies. The analysis includes a comparison between semantic similarities obtained for pairs of tags in Delicious folksonomy, and their semantic distances in the whole Web, according to co-occurrence based metrics computed with results of a Web search engine.This work was supported by the Spanish Ministry of Science
and Innovation (TIN2008-06566-C04-02), and Universidad Autónoma de Madrid
(CCG10-UAM/TIC-5877
- …